Comparison of Word-Based and Syllable-Based Retrieval for Tibetan
نویسندگان
چکیده
Tibetan retrieval based on automatically segmented words is compared with the use of overlapping syllable n-grams using a known-item retrieval evaluation. The optimal span of fixed-length n-grams is found to be 2 syllables, and indexing words is found to be as effective as indexing syllable bigrams.
منابع مشابه
Tibetan Multi-word Expressions Identification Framework Based on News Corpora
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao’s N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework wh...
متن کاملUsing syllable-based indexing features and language models to improve German spoken document retrieval
Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used t...
متن کاملStemming and Segmentation for Classical Tibetan
Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that strive to identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining stem similarity of two syllables...
متن کاملConnected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملSyllable-based Language Models in Speech Recognition for English Spoken Document Retrieval
The spoken content of audio/visual collections such as TV or radio archives is an information resource of enormous potential. The challenge is to develop methods that will make it possible to browse or search these collections. The experimental results presented in this paper demonstrate that syllable-level transcripts provide an important supplement to conventional word-level transcripts for t...
متن کامل